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1. INTRODUCTION 

The research has its intent towards identification of crop yield in regional areas of Indian territory. 
The research also has its focus on benchmark datasets consisting of various attributes that is required to 
ascertain the crop yield in agricultural areas of Karnataka regions. The information collected from Gandhi 
Krishi Vigyan Kendra (GKVK) Bangalore has given useful input in the form of attributes of agricultural 
land. The machine learning (ML) considered the actual information in terms of crop yield in the past 5 years 
based on attributes mentioned in Table 1. This information is helpful in understanding the crop yield in the 
present year based on the quality of soil present in that region and the amount of rainfall indicated in dataset. 
The crop yield prediction [1]-[6] can be done for next subsequent years based on the temperature and 
humidity of the soil also makes significant contribution to the system. Machine learning mentioned in 
[7]-[11] has its objective towards identifying the crop yield in various regions by employing the erected 
methods on benchmark dataset. The dataset includes the information of soil, temperature, humidity, pH value 
and various other criteria that helps the system to predict the percentage of crop yield in regional areas of 
Indian plateau. Figure | indicates the strategy used to classify the data of a dataset into different classes of 
information. 

Machine learning techniques are usually classified into supervised and unsupervised techniques. 
Supervised machine learning starts from prior knowledge of the desired result in the form of labeled data 
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sets, which allows to guide the training process as per [12]-[16], whereas unsupervised machine learning 
works directly on unlabeled data. In the absence of labels to orient the learning process, these labels must be 
“discovered” by the learning algorithm [1]. In this technical report, we discuss the desirable features of good 
clustering [17]—[24] results, recall Kleinberg’s impossibility theorem for clustering, and describe a taxonomy 
of evaluation criteria for unsupervised machine learning. We also survey many of the evaluation metrics that 
have been proposed in the literature. We end our report by describing the techniques that can be used to 
adjust the parameters of clustering algorithms, i.e. their hyper-parameters. 


Table 1. Parameters used for prediction of crop yield 


SIL.No Parameters used Percentage of Nutrients in Soil # Times used in research data 
1 Temperature 23c 24 
2 Soil type Black 17 
3 Rainfall 25 cms 17 
4 Crop information 2 times/year 13 
5 pH-value 83 11 
6 Humidity 29 11 
7 Area of production 2 8 
8 Fertilization 31 T 
9 Normalized difference vegetation index (NDVI) 56 6 
10 Nitrogen 19 6 
11 Potassium 73 5 
12 Zinc 44 3 
13 Magnesium 18 3 
14 Sulphur 91 2 
15 Boron 86 2 
16 Calcium 93 2 
17 Carbon 86 2 
18 Phosphorous 74 2 
19 Climate Rainy 1 

20 Time 6 months 1 
21 Manganese 63 1 


Figure 1. Architecture of proposed correlation of similarity learning for crop yield prediction 


2. RELATED WORK 

The research has been carried out by identifying the research gaps in the study of processing of soil 
data and classification of soil data into different classes of information based on evolved algorithms 
implementation. These evolved method has hinted erected research to come-up with a distance metric based 
algorithm for extraction of soil data and classification of same on benchmark dataset along with customized 
dataset [5], [6]. The paper evaluates the process of creating and selecting the attributes with machine learning 
methods for classification of data items through the research [22]-[28]. The research papers throw light in the 
areas of machine learning that is very helpful for processing and classifying the data items into different 
classes of data such as eroded soil or suitable for cultivation based on machine learning strategies. These ML 
techniques has driven a great advantage over specific dataset collected form Kaggle and it has pitched the 
direction as to how the machine learning methods has put forth for the purpose of classification [7]-[11], 
[13]-[17], [29]-[33]. The research papers such as [3]-[6], [12]-[14] has thrown light in to the areas of 
agriculture for the purpose of classification of soil data into different classes of information with machine 
learning and deep learning. The research presents a features that is helpful in training the system for 
extraction of data and classification of data into different classes of information [6], [12], [13]. The 
concluding remarks that shall be drawn from the research paper for the purpose of processing of features for 
data items into different classes of information and segregation of data into different classes of suitable data 
is done in [14]-[16], [29]. The research also has certain important information for processing and segregation 
of data into different classes of information is dine with ML. A survey on various data mining techniques has 
led to the outcome of proposed method for classification of soil data as per [18], [20]—[22]. 
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3. PROPOSED ALGORITHM 

The statistical information employed in this research work clusters the data items into group based 
on similarity measure and predicts the percentage of crop yield in various regions are described in algorithm 
statistical learning for crop yield prediction (SL-CYP). The algorithm has a better efficiency with other 
contemporary approaches in terms of pushing the data samples close to each other and pulls the samples far 
away from one another for samples with threshold values. The SL-CYP has yielded good performance in 
terms of measuring the crop yield in regional areas of India. 


Algorithm: predicts the crop yield from attributes 

Algorithm: statistical learning for crop yield prediction (SL-CYP) 
Description: predicts the crop yield with similarity measure 
Output: yields the classification results with performance 


Begin 
Step 1 [Initialization] 
Solve (1) 
Step 2 [Parameter Tuning] 
Step 2.1 Solve (2), (3) and (4) 
Step 2.2 Combine Step 2.1 
Step 3 [Optimization] 
Step 3.1 
Step 3.2 Solve (5), (6) and (7) 
Step 3.3 Combine (8) 
Step 3.4 Repeat Step 2 and Step 3 
Subtract (9) from step 2.2 and 3.2 
End 
4. METHOD 


The proposed method involves various phases of prediction of crop yielding from various regions. 
The phases include data collection, pre-processing, processing of data with statistical information, features 
extraction for training the system to learn as to how the prediction or classification of data into different 
classes is to be made for a specific information present in data items. The crop yield prediction is achieved 
with the help of nutrient information and its abundance in wide range of location that is helpful to farmers in 
the presence of good nutrients in soil. 


4.1. Collection of data 

The research has focused its attention towards prediction of crop yield in various regions of Indian 
Plateau. The regions of Indian plateau have significant difference in their rainfall and temperature along with 
various other parameters like nutrients in soil. The farmer’s crop yield predication can be assessed by 
analyzing the moisture content present in the soil alongside the type of crop suitable for specific regions of 
Indian Plateau. The Figure 2 indicates the regions of Indian Plateau consisting of different states such as 
Karnataka and others which is suitable for particular type of soil. Based on the information available and 
collected from agricultural department of Karnataka, the research experimental conclusions have been drawn. 
The assessment of soil quality and its suitability in various regions has been predicted with the help of 
statistical approach incorporated in algorithm. Furthermore, Figure 2(a) indicates the nation where the soil 
details are considered for research purpose. Similarly, Figure 2(b) presents the region within India 
(Karnataka) from where the details of soil is subjected for processing, Figure 2(c) presents the nutrients of 
soil in regions of Karnataka used as a reference for assessing the soil nutrients and its usefulness in predicting 
crop yield in these regional areas of Karnataka within India. 


4.2. Data pre-processing 

The data preprocessing involves cleaning of data. The purpose of cleaning of data is needed to 
remove any redundant data from data items present in it. The repeated data or blank data is cleaned and 
normalized to the standard format for processing of data items prior to the estimation of crop yield in regional 
areas of Indian Plateau. Further, data pre-processing involves cleaning of missing data items from attributes 
specified in dataset. The challenging issues faced in this research involves data cleaning with missing data. 
These missing data is eliminated from the dataset by ignoring the tuples that corresponds to attributes. The 
second approach that has been employed in this research is cleaning of noisy data from a set of attributes of 
data items. In order to eliminate these cleaning of noisy data items, we have incorporated this proposed 
ideology with binning the data items. The process of binning not only removes the noise present in attributes, 
it also smoothens the vibrational aspects present in it. 
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Figure 2. Maps representing the regions of India which includes Karnataka and the nutrients used for 
prediction of crop yield: (a) indicates the geographic map of India (b) represents the geographic map of 
Karnataka, and (c) indicates the parameters considered for prediction of soil nutrients 


4.3. Expansion of patterns for similarity with clustering 

The attributes of dataset have been explored for measuring the patterns of similarity and comparing 
the co-related data items with similarity metrics. The similarity metrics also helps the system to identify the 
attributes that exposes the hidden similarity with other attributes in support of clustering. The erected 
research also focuses its attention towards identifying the attributes that has certain significance in terms of 
distance metrics as well as patterns of data items. The similarity metrics may also be used in terms of 
distances such as Euclidean similarity distance metric learning and various other aspects of patterns of data 
items. The patterns of data items have been used in combination of distance metrics along with pattern 
matching. The patterns that are similar to each other up to certain threshold are pushed closer to each other 
and patterns that are less than the threshold values are pulled as far as possible to help improvise the 
clustering of data items. 

The erected research has thrown light on patterns of similarity that may be established with distance 
metrics such as Euclidean distance. The Euclidean distance metric learning has yielded good classification 
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accuracy alongside with clustering. As shown in (1) has its significance in terms of measuring the similarity 


patterns of data items. The system pushes the attributes of data items that are more than the threshold closer 
to each other and pulls the attributes that are dissimilar far away from one another. 


d(p,q) = y (p2 — pı)" + A(G2 — G1) (1) 


The erected research has been formulated to optimize the performance of measuring the similarity of patterns 
such as (2) to (5). 


max S (A) = S,(A) + S2(A) — 53(A) (2) 
Si(A) = z m=1 EE (Pm J qu) WW" (Pm = qe,) (3) 
tr (W7 Yin LE (Pm — de) (Pm — d W) (4) 
tr(W70,W) (5) 


Similarly, (3) may be written to identify the pattern similarity of data items in the form of (6). The equation 
(6) represented in the form of simplified manner 


S2(A) = n mei DE (Pa = dm) WWT (pe, = qm) (6) 
tr (wt wy Emei LE (Pu — Im) (Pe — dm)W) (7) 
tr(W70,W) (8) 


Similarly, the data items of dataset needs to be trained with other parameters, the patterns are tuned 
by keeping one of the variable constant and the other is tuned, likewise the other parameter is kept constant 
and the present variable is tuned to extract similarity of patterns that may be suitable for parameter 
optimization and further helps the systems to learn the features of similar patterns to be pushed closer to each 
other for clustering and pulled other patterns that are dissimilar far away from one another. 


$3(A) = tr (W7 SYP 1m — dn) (Pm — 9n)W) (9) 
tr(W"03,W) (10) 
max S(W) = tr[W7 (0, + 0, — 03)W] (11) 


The optimization function helps the system to analyze and understand the features closer to each other. 


(01 + 02 — 03)w = Aw (12) 


5. RESULTS AND DISCUSSION 

The research process has given its contribution in the form of implementation with python. The 
implementation of these objectives has been achieved with certain built-in functionalities like Numpy, 
Matplotlib and various other functions are used for assessing the performance of system in a better manner. 
The result of processing with evolved approach has yielded a good classification accuracy of 89.62% with a 
benchmark dataset collected from GKVK Bangalore. The challenging issues that the research has faced in 
terms of cleaning and finding similarities of attributes of information has led to good performance measures 
as mentioned in Figures 3 to 6. 


5.1. Metrics used 

The three metrics are considered for assessing the prediction results of erected approach with 
existing methods. These are precision, recall and F-Measure. The precision is calculated with two important 
parameters such as true positive and false positive. As shown in (11) defines paves the way to identify the 
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performance of the proposed method while predicting the result of clustering alongside classification 
accuracy. Similarly, (12) recall considers two significant parameters such as true positive and false negative 
while f-measure as per (13) calculate uses both results of precision and recall for measuring the prediction 
results of the erected approach. The equations (13) to (15) determines the performance of a system. 


rar TruePositive 
Precision = ———___—____ (13) 
TruePositive+FalsePositive 
TruePositive 
Recall = ——__—___ (14) 
TruePositive+FalseNegative 
(2*precision+recall) 
fmeasure = ———————— (15) 
precision+recall 


5.2. Comparison of performance 

The performance of erected method over actual data items vs. predicted results has shown its 
significance in terms of precision and recall. The research result obtained in graphical representation 
indicates the value of precision mapped into recall for analysis of proposed method. The efficacy of the 
proposed method indicates the result is better than other contemporary methods as it is mentioned in related 
work from [17]-[19], [29], [30]. 


Precision vs. Recall 


: ... 
— 
> n n r 


Rečall 4 6 


Precision 


Figure 3. Prediction result of proposed method with respect to ground truth data is shown for five attributes 
like temperature, soil type, rainfall, crop information, and pH value 
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Figure 4. Prediction result of proposed method with respect to ground truth data is shown for five attributes 
like humidity, area of production, fertilization, NDVI, and nitrogen 
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Figure 5. Prediction result of proposed method with respect to ground truth data is shown for five attributes 
like potassium, zinc, magnesium, Sulphur, boron, and calcium 
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Figure 6. Prediction result of proposed method with respect to ground truth data is shown for five attributes 
like carbon, phosphorous, climate, time, and manganese 


6. CONCLUSION 

The evolved research has produced good classification accuracy alongside clustering while 
measuring the performance with existing methods. Since, the research has yielded good classification 
accuracy of 89.62% over a dataset which is collected from agricultural department GK VK. The dataset is 
considered as a benchmark, as it considers 21 attributes as different conditions of agriculture where we find it 
useful for ascertaining the crop yield. The proposed statistical features learning for crop yield prediction has 
been considered as a state-of-the-art technique, as it provides good efficacy over a dataset and yields good 
results of classification in comparison with other methods. 
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